Photo by Stephen Phillips - Hostreviews.co.uk on UnSplash

In Part 4, we looked into some crucial sample pre-processing steps before modeling, establishing the required pipeline for data processing, evaluating various algorithms, and ultimately identifying an appropriate baseline model, that is CatBoost. As we proceed to Part 5, our focus will be on assessing the significance of features in the initial scraped dataset. We’ll achieve this by employing the feature_importances_ method of CatBoostRegressor and analyzing SHAP values. Additionally, we’ll systematically eliminate features showing lower importance or predictive capability. Excited to delve into this phase!

Note

You can access the project’s app through its Streamlit website.

Import data

Code

from pathlib import Path

import catboost
import numpy as np
import pandas as pd
import shap
from data import pre_process, utils
from IPython.display import clear_output
from lets_plot import *
from lets_plot.mapping import as_discrete
from models import train_model
from sklearn import metrics, model_selection
from tqdm import tqdm

LetsPlot.setup_html()

Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)

Prepare dataframe before modelling

Read in dataframe

Code

df = pd.read_parquet(
    utils.Configuration.INTERIM_DATA_PATH.joinpath(
        "2023-10-01_Processed_dataset_for_NB_use.parquet.gzip"
    )
)

Train-test split

Our initial step involves executing a train-test split. This process divides the data into two distinct sets: a training set and a testing set. The training set is employed for model training, while the testing set is exclusively reserved for model evaluation. This methodology allows models to be trained on the training set and then assessed for accuracy using the unseen testing set. Ultimately, this approach enables an unbiased evaluation of our model’s performance, utilizing the test set that remained untouched during the model training phase.

Code

train, test = model_selection.train_test_split(
    df, test_size=0.2, random_state=utils.Configuration.seed
)

print(f"Shape of train: {train.shape}")
print(f"Shape of test: {test.shape}")

Shape of train: (2928, 56)
Shape of test: (732, 56)

Preprocess dataframe for modelling

Our first step entails removing features that lack informative value for our model, including ‘external_reference’, ‘ad_url’, ‘day_of_retrieval’, ‘website’, ‘reference_number_of_the_epc_report’, and ‘housenumber’. After this, we’ll proceed to apply a log transformation to our target variable. Furthermore, we’ll address missing values in categorical features by replacing them with the label “missing value.” This step is crucial as CatBoost can handle missing values in numerical columns, but for categorical missing values, user intervention is needed.

Code

processed_train = (
    train.reset_index(drop=True)
    .assign(price=lambda df: np.log10(df.price))  # Log transformation of 'price' column
    .drop(columns=utils.Configuration.features_to_drop)
)

# This step is needed since catboost cannot handle missing values when feature is categorical
for col in processed_train.columns:
    if processed_train[col].dtype.name in ("bool", "object", "category"):
        processed_train[col] = processed_train[col].fillna("missing value")

processed_train.shape

(2928, 50)

Inspect feature importance

To evaluate feature importance, we’re utilizing two methods: the feature_importances_ attribute in CatBoost and SHAP values from the SHAP library. To begin examining feature importances, we’ll initiate model training. This involves further partitioning the training set, reserving a portion for CatBoost training (validation dataset). This segmentation allows us to stop the training process when overfitting emerges. Preventing overfitting is crucial, as it ensures we don’t work with an overly biased model. Additionally, if overfitting occurs, stopping training earlier helps conserve time and resources.

Code

features = processed_train.columns[~processed_train.columns.str.contains("price")]

numerical_features = processed_train.select_dtypes("number").columns.to_list()
categorical_features = processed_train.select_dtypes("object").columns.to_list()

train_FS, validation_FS = model_selection.train_test_split(
    processed_train, test_size=0.2, random_state=utils.Configuration.seed
)

# Get target variables
tr_y = train_FS[utils.Configuration.target_col]
val_y = validation_FS[utils.Configuration.target_col]

# Get feature matrices
tr_X = train_FS.loc[:, features]
val_X = validation_FS.loc[:, features]


print(f"Train dataset shape: {tr_X.shape} {tr_y.shape}")
print(f"Validation dataset shape: {val_X.shape} {val_y.shape}")

Train dataset shape: (2342, 49) (2342,)
Validation dataset shape: (586, 49) (586,)

Code

train_dataset = catboost.Pool(tr_X, tr_y, cat_features=categorical_features)
validation_dataset = catboost.Pool(val_X, val_y, cat_features=categorical_features)

As you can see below, the training loss steadily decreases, but the validation loss reaches a plateau. Thanks to the validation dataset, we can halt the training well before the initially set 2000 iterations. This early cessation is crucial for preventing overfitting and ensures a more balanced and effective model.

Code

model = catboost.CatBoostRegressor(
    iterations=2000,
    random_seed=utils.Configuration.seed,
    loss_function="RMSE",
)

model.fit(
    train_dataset,
    eval_set=[validation_dataset],
    early_stopping_rounds=20,
    use_best_model=True,
    verbose=2000,
    plot=True,
)

Learning rate set to 0.038132
0:  learn: 0.3187437    test: 0.3148628 best: 0.3148628 (0) total: 186ms    remaining: 6m 12s
Stopped by overfitting detector  (20 iterations wait)

bestTest = 0.1087393065
bestIteration = 850

Shrink model to first 851 iterations.

<catboost.core.CatBoostRegressor at 0x240921b25c0>

According to our trained CatBoost model, the most significant feature in our dataset is the living_area, followed by cadastre_income and latitude. To validate and compare these findings, we’ll examine the SHAP values to understand how they align or differ from the feature importances provided by the model.

Code

(
    pd.concat(
        [pd.Series(model.feature_names_), pd.Series(model.feature_importances_)], axis=1
    )
    .sort_values(by=1, ascending=False)
    .rename(columns={0: "name", 1: "importance"})
    .reset_index(drop=True)
    .pipe(
        lambda df: ggplot(df, aes("name", "importance"))
        + geom_bar(stat="identity")
        + labs(
            title="Assessing Feature Importance",
            subtitle=""" based on the feature_importances_ attribute
            """,
            x="",
            y="Feature Importance",
            caption="https://www.immoweb.be/",
        )
        + theme(
            plot_subtitle=element_text(
                size=12, face="italic"
            ),  # Customize subtitle appearance
            plot_title=element_text(size=15, face="bold"),  # Customize title appearance
        )
        + ggsize(800, 600)
    )
)

Figure 1: Assessing Feature Importance

SHAP

SHAP (SHapley Additive exPlanations) values offer a method to elucidate the predictions made by any machine learning model. These values leverage a game-theoretic approach to quantify the contribution of each “player” (feature) to the final prediction. In the realm of machine learning, SHAP values assign an importance value to each feature, delineating its contribution to the model’s output.

SHAP values provide detailed insights into how each feature influences individual predictions, their relative significance in comparison to one another, and the model’s reliance on interactions between features. This comprehensive perspective enables a deeper understanding of the factors that drive the model’s decision-making process.

In this phase, our primary focus will involve computing SHAP values and then creating visualizations, such as bar plots and beeswarm plots, to illustrate feature importance and interactionss.

Code

shap.initjs()
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(
    catboost.Pool(tr_X, tr_y, cat_features=categorical_features)
)

The summary plot at Figure 2 provides a clear depiction of feature importance within the model. The outcomes reveal that “Living area,” “Surface of the plot,” and “Cadastral income” emerge as pivotal factors in influencing the model’s predictions. These features prominently contribute to determining the model’s results.

Code

shap.summary_plot(shap_values, tr_X, plot_type="bar", plot_size=[12, 6])

Figure 2: Assessing Feature Importance using bar plot

The beeswarm plot (Figure 3) can be best understood through the following breakdown:

The Y-axis arranges the feature names based on their importance, with the most influential features placed at the top.
The X-axis displays the SHAP value, representing the impact of each feature on the model’s output. Features on the right side of the plot exhibit a stronger impact, while those on the left possess a weaker influence.

Moreover: - Each point’s color on the plot signifies the respective feature’s value for that specific data point. Red indicates high values, while blue represents low values. - Every individual point on the plot corresponds to a particular row from the original dataset. Collectively, these points illustrate how different data points and their associated feature values contribute to the model’s predictions, especially concerning feature importance.

Upon examining the “living area” feature, you’ll notice it predominantly displays a high positive SHAP value. This suggests that a larger living area tends to have a positive effect on the output, which in this context is the price. Conversely, higher values of “primary energy consumption” are associated with a negative impact on the price, reflected by their negative SHAP values.

Consideration of the spread of SHAP values and their relation to predictive power is important. A wider spread or a denser distribution of data points implies greater variability or a more significant influence on the model’s predictions. This insight allows us to evaluate the significance of features regarding their contribution to the model’s overall output.

This context clarifies why “living area” holds greater importance compared to “CO2 emission.” The broader impact and higher variability of the “living area” feature in influencing the model’s predictions make it a more crucial determinant of the output, thus carrying more weight in the model’s decision-making process.rtance.

Code

shap.summary_plot(shap_values, tr_X)

No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored

Figure 3: Assessing Feature Importance using beeswarm plot

Let’s examine the ranking or order of feature importances derived from both Gini impurity and SHAP values to understand how they compare and whether they yield similar or differing insights. As you can see from the table below, they are fairly similar.

Code

catboost_feature_importance = (
    pd.concat(
        [pd.Series(model.feature_names_), pd.Series(model.feature_importances_)], axis=1
    )
    .sort_values(by=1)
    .rename(columns={0: "catboost_name", 1: "importance"})
    .reset_index(drop=True)
)

Code

shap_feature_importance = (
    pd.DataFrame(shap_values, columns=tr_X.columns)
    .abs()
    .mean()
    .sort_values()
    .reset_index()
    .rename(columns={"index": "shap_name", 0: "shap"})
)

Code

pd.concat(
    [
        catboost_feature_importance.drop(columns="importance"),
        shap_feature_importance.drop(columns="shap"),
    ],
    axis=1,
)

	catboost_name	shap_name
0	flood_zone_type	flood_zone_type
1	dining_room	planning_permission_obtained
2	proceedings_for_breach_of_planning_regulations	proceedings_for_breach_of_planning_regulations
3	planning_permission_obtained	street_frontage_width
4	furnished	furnished
5	connection_to_sewer_network	double_glazing
6	possible_priority_purchase_right	dining_room
7	street_frontage_width	possible_priority_purchase_right
8	as_built_plan	connection_to_sewer_network
9	double_glazing	heating_type
10	subdivision_permit	bedroom_1_surface
11	basement	subdivision_permit
12	gas_water__electricity	available_as_of
13	heating_type	as_built_plan
14	tenement_building	garden_surface
15	street	tenement_building
16	bedroom_3_surface	bedroom_2_surface
17	bedroom_1_surface	bedroom_3_surface
18	bedroom_2_surface	latest_land_use_designation
19	garden_surface	basement
20	office	gas_water__electricity
21	tv_cable	width_of_the_lot_on_the_street
22	width_of_the_lot_on_the_street	living_room_surface
23	latest_land_use_designation	tv_cable
24	kitchen_surface	street
25	available_as_of	kitchen_surface
26	living_room_surface	office
27	co2_emission	surroundings_type
28	surroundings_type	outdoor_parking_spaces
29	outdoor_parking_spaces	co2_emission
30	covered_parking_spaces	bedrooms
31	postal	covered_parking_spaces
32	bedrooms	postal
33	energy_class	construction_year
34	primary_energy_consumption	lng
35	construction_year	energy_class
36	lng	toilets
37	number_of_frontages	primary_energy_consumption
38	toilets	kitchen_type
39	yearly_theoretical_total_energy_consumption	number_of_frontages
40	kitchen_type	building_condition
41	building_condition	lat
42	city	yearly_theoretical_total_energy_consumption
43	surface_of_the_plot	city
44	bathrooms	bathrooms
45	state	state
46	lat	cadastral_income
47	cadastral_income	surface_of_the_plot
48	living_area	living_area

Recursive feature elimination based on SHAP values

Next, we’ll work on the initial feature elimination process based on SHAP values using CatBoost’s select_features method. Although a rich set of features can be advantageous, the quest for model interpretability prompts us to consider the need for a streamlined feature set.

Our objective here is to remove features that have minimal impact on the final predictive output, retaining only the most influential ones. This action streamlines our model, enhancing its interpretability and making it easier to comprehend the factors driving its predictions.s.

Code

regressor = catboost.CatBoostRegressor(
    iterations=1000,
    cat_features=categorical_features,
    random_seed=utils.Configuration.seed,
    loss_function="RMSE",
)

rfe_dict = regressor.select_features(
    algorithm="RecursiveByShapValues",
    shap_calc_type="Exact",
    X=tr_X,
    y=tr_y,
    eval_set=(val_X, val_y),
    features_for_select="0-48",
    num_features_to_select=1,
    steps=10,
    verbose=250,
    train_final_model=False,
    plot=True,
)

Learning rate set to 0.0582
Step #1 out of 10
0:  learn: 0.3152318    test: 0.3115119 best: 0.3115119 (0) total: 22.3ms   remaining: 22.3s
250:    learn: 0.0875657    test: 0.1143850 best: 0.1143850 (250)   total: 6.63s    remaining: 19.8s
500:    learn: 0.0676185    test: 0.1092529 best: 0.1092412 (497)   total: 13.3s    remaining: 13.2s
750:    learn: 0.0567661    test: 0.1072066 best: 0.1072004 (749)   total: 20.2s    remaining: 6.71s
999:    learn: 0.0474957    test: 0.1066737 best: 0.1065423 (961)   total: 27.2s    remaining: 0us

bestTest = 0.1065422651
bestIteration = 961

Shrink model to first 962 iterations.
Feature #38 eliminated
Feature #28 eliminated
Feature #20 eliminated
Feature #22 eliminated
Feature #19 eliminated
Feature #14 eliminated
Feature #11 eliminated
Feature #0 eliminated
Feature #31 eliminated
Feature #15 eliminated
Feature #30 eliminated
Feature #41 eliminated
Feature #33 eliminated
Feature #17 eliminated
Feature #45 eliminated
Feature #18 eliminated
Step #2 out of 10
0:  learn: 0.3155660    test: 0.3112354 best: 0.3112354 (0) total: 26ms remaining: 26s
250:    learn: 0.0887768    test: 0.1128068 best: 0.1127729 (248)   total: 7.14s    remaining: 21.3s
500:    learn: 0.0702646    test: 0.1078201 best: 0.1078179 (499)   total: 14.2s    remaining: 14.1s
750:    learn: 0.0590862    test: 0.1064625 best: 0.1064480 (746)   total: 21.5s    remaining: 7.14s
999:    learn: 0.0505860    test: 0.1056381 best: 0.1056078 (984)   total: 28.5s    remaining: 0us

bestTest = 0.1056077628
bestIteration = 984

Shrink model to first 985 iterations.
Feature #39 eliminated
Feature #2 eliminated
Feature #21 eliminated
Feature #34 eliminated
Feature #4 eliminated
Feature #6 eliminated
Feature #1 eliminated
Feature #24 eliminated
Feature #9 eliminated
Feature #35 eliminated
Feature #5 eliminated
Step #3 out of 10
0:  learn: 0.3150998    test: 0.3120655 best: 0.3120655 (0) total: 26.3ms   remaining: 26.2s
250:    learn: 0.0878914    test: 0.1132226 best: 0.1132226 (250)   total: 6.59s    remaining: 19.7s
500:    learn: 0.0692551    test: 0.1082705 best: 0.1082664 (499)   total: 12.7s    remaining: 12.6s
750:    learn: 0.0578095    test: 0.1068464 best: 0.1068464 (750)   total: 18.5s    remaining: 6.13s
999:    learn: 0.0495254    test: 0.1058574 best: 0.1058574 (999)   total: 24.9s    remaining: 0us

bestTest = 0.1058574346
bestIteration = 999

Feature #26 eliminated
Feature #37 eliminated
Feature #29 eliminated
Feature #16 eliminated
Feature #12 eliminated
Feature #13 eliminated
Feature #7 eliminated
Step #4 out of 10
0:  learn: 0.3142786    test: 0.3109611 best: 0.3109611 (0) total: 26.1ms   remaining: 26.1s
250:    learn: 0.0905292    test: 0.1146754 best: 0.1146491 (249)   total: 6.14s    remaining: 18.3s
500:    learn: 0.0714765    test: 0.1117756 best: 0.1117756 (500)   total: 11.4s    remaining: 11.4s
750:    learn: 0.0602824    test: 0.1109848 best: 0.1109848 (750)   total: 17s  remaining: 5.65s
999:    learn: 0.0519507    test: 0.1108987 best: 0.1107855 (849)   total: 22.8s    remaining: 0us

bestTest = 0.1107855404
bestIteration = 849

Shrink model to first 850 iterations.
Feature #46 eliminated
Feature #23 eliminated
Feature #27 eliminated
Feature #40 eliminated
Feature #43 eliminated
Step #5 out of 10
0:  learn: 0.3143922    test: 0.3107735 best: 0.3107735 (0) total: 22.2ms   remaining: 22.1s
250:    learn: 0.0934222    test: 0.1193447 best: 0.1193447 (250)   total: 5.43s    remaining: 16.2s
500:    learn: 0.0748984    test: 0.1172378 best: 0.1172378 (500)   total: 11.1s    remaining: 11.1s
750:    learn: 0.0624621    test: 0.1166952 best: 0.1166448 (743)   total: 16.4s    remaining: 5.44s
999:    learn: 0.0542961    test: 0.1163795 best: 0.1163795 (999)   total: 21.9s    remaining: 0us

bestTest = 0.1163795207
bestIteration = 999

Feature #48 eliminated
Feature #32 eliminated
Feature #3 eliminated
Step #6 out of 10
0:  learn: 0.3148243    test: 0.3111962 best: 0.3111962 (0) total: 19.3ms   remaining: 19.2s
250:    learn: 0.1038084    test: 0.1239657 best: 0.1239657 (250)   total: 5.26s    remaining: 15.7s
500:    learn: 0.0861433    test: 0.1229114 best: 0.1228575 (493)   total: 10.3s    remaining: 10.3s
750:    learn: 0.0751895    test: 0.1228634 best: 0.1225339 (626)   total: 15.5s    remaining: 5.14s
999:    learn: 0.0674641    test: 0.1229819 best: 0.1225339 (626)   total: 20.6s    remaining: 0us

bestTest = 0.1225338768
bestIteration = 626

Shrink model to first 627 iterations.
Feature #42 eliminated
Feature #36 eliminated
Step #7 out of 10
0:  learn: 0.3147967    test: 0.3113498 best: 0.3113498 (0) total: 19.5ms   remaining: 19.4s
250:    learn: 0.1205611    test: 0.1354225 best: 0.1354225 (250)   total: 6.42s    remaining: 19.2s
500:    learn: 0.1039180    test: 0.1343591 best: 0.1341680 (459)   total: 12.3s    remaining: 12.3s
750:    learn: 0.0935743    test: 0.1354581 best: 0.1341680 (459)   total: 17.3s    remaining: 5.73s
999:    learn: 0.0856311    test: 0.1359971 best: 0.1341680 (459)   total: 23.1s    remaining: 0us

bestTest = 0.1341679745
bestIteration = 459

Shrink model to first 460 iterations.
Feature #8 eliminated
Feature #44 eliminated
Step #8 out of 10
0:  learn: 0.3148282    test: 0.3113391 best: 0.3113391 (0) total: 3.58ms   remaining: 3.58s
250:    learn: 0.1370998    test: 0.1535179 best: 0.1531225 (197)   total: 494ms    remaining: 1.47s
500:    learn: 0.1214322    test: 0.1553576 best: 0.1531225 (197)   total: 946ms    remaining: 942ms
750:    learn: 0.1107183    test: 0.1568469 best: 0.1531225 (197)   total: 1.29s    remaining: 427ms
999:    learn: 0.1023147    test: 0.1583956 best: 0.1531225 (197)   total: 1.78s    remaining: 0us

bestTest = 0.1531224776
bestIteration = 197

Shrink model to first 198 iterations.
Feature #47 eliminated
Step #9 out of 10
0:  learn: 0.3155241    test: 0.3116638 best: 0.3116638 (0) total: 2.69ms   remaining: 2.69s
250:    learn: 0.1768252    test: 0.1824890 best: 0.1823469 (181)   total: 335ms    remaining: 1s
500:    learn: 0.1659879    test: 0.1851122 best: 0.1823469 (181)   total: 661ms    remaining: 659ms
750:    learn: 0.1590130    test: 0.1875015 best: 0.1823469 (181)   total: 1.2s remaining: 399ms
999:    learn: 0.1531420    test: 0.1895159 best: 0.1823469 (181)   total: 1.54s    remaining: 0us

bestTest = 0.1823469242
bestIteration = 181

Shrink model to first 182 iterations.
Feature #10 eliminated
Step #10 out of 10
0:  learn: 0.3166921    test: 0.3133613 best: 0.3133613 (0) total: 1.65ms   remaining: 1.65s
250:    learn: 0.2114648    test: 0.2165803 best: 0.2164558 (240)   total: 469ms    remaining: 1.4s
500:    learn: 0.2075425    test: 0.2186096 best: 0.2164558 (240)   total: 783ms    remaining: 780ms
750:    learn: 0.2059289    test: 0.2203856 best: 0.2164558 (240)   total: 1.11s    remaining: 368ms
999:    learn: 0.2051774    test: 0.2217136 best: 0.2164558 (240)   total: 1.65s    remaining: 0us

bestTest = 0.2164557771
bestIteration = 240

Shrink model to first 241 iterations.

Through Recursive Feature Elimination, we’ve successfully decreased the number of features from an initial count of 49 to a more concise set of 17, up to and including the “bedrooms” feature. This reduction in features hasn’t notably affected our model’s performance, enabling us to retain a comparable level of predictive accuracy. This streamlined dataset enhances the model’s simplicity and interpretability without compromising its effectiveness.

Code

features_to_keep = (
    rfe_dict["eliminated_features_names"][33:] + rfe_dict["selected_features_names"]
)

print(features_to_keep)

['bedrooms', 'state', 'kitchen_type', 'number_of_frontages', 'toilets', 'street', 'lng', 'primary_energy_consumption', 'bathrooms', 'yearly_theoretical_total_energy_consumption', 'surface_of_the_plot', 'building_condition', 'city', 'lat', 'cadastral_income', 'living_area']

That’s it for now. In the next post, our focus will be on identifying potential outliers within our dataset. Additionally, we’ll delve into several further feature engineering steps aimed at bolstering our model’s performanceFor this, we will useof cross-validation, ensuring the robustness and reliability of our modes. Looking forward to the next steps!